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ABSTRACT: As providers of higher education begin to harness the power of big data analytics, one 
very fitting application for these new techniques is that of predicting student attrition. The ability to 
pinpoint students who might soon decide to drop out, or who may be following a suboptimal path to 
success, allows those in charge not only to understand the causes for this undesired outcome, but 
provides room for the development of early intervention systems. While making such inferences 
based on academic performance data alone is certainly possible, we claim that in many cases there is 
no substantial correlation between how well a student performs and his/her decision to withdraw. 

This is especially true when the overall set of students has a relatively similar academic performance. 

To address this issue, we derive measurements of engagement from students' electronic portfolios 
and show how these features can be used effectively to augment the quality of predictions. 

KEYWORDS: Electronic Portfolios, Student Retention, Early Intervention, Data Fusion, Learning 
Analytics, Predictive Analytics 

1. INTRODUCTION 

Over the course of many years, the education field has gone through several transformations. As new 
techniques for both teaching and assessing students emerge, universities and other post-secondary 
institutions are expected to adapt quickly and begin to follow the new norms. Further, as the needs of 
our society shift, we often see increased demands for professionals in particular disciplines. Most 
recently, this phenomenon can be observed with respect to the areas of Science, Technology, 
Engineering, and Mathematics (STEM). 

While creating an environment that stimulates student enrollment in these particular fields is a 
challenge in itself, preserving high retention rates can be a far more complicated task. As Seidman 
(2005) highlights, our understanding of retention has considerably changed over time, and efforts to 
address the issue are ubiquitous in higher education today. Yet, despite the rapid growth of this subject 
over the last few years, there are clear indications that the complexities involved with helping a highly 
diverse array of students to succeed are far from being understood. 

It is estimated that nearly half of the students who drop out of their respective programs do so within 
their first year in college (Delen, 2011). Consequently, a clear focus has been directed towards early 
identification and diagnosis of at-risk students, and a variety of studies using statistical methods, data 
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mining, and machine learning techniques can be found in recent literature (Thammasiri, Delen, Meesad, 
& Kasap, 2013; DeBerard, Spielmans, & Julka, 2004; Zhang, Anderson, Ohland, & Thorndyke, 2004; 
Burtner, 2005; Yu, DiGangi, Jannasch-Pennell, Lo, & Kaprolet, 2007; Mendez, Buskirk, Lohr, & Haag, 
2008; Li, Swaminathan, & Tang, 2009; Lin, Imbrie, & Reid, 2009; Delen, 2010; Zhang, Oussena, Clark, & 
Hyensook, 2010). 


A downside of these proposed models is that they frequently rely strictly on academic performance, 
demographic, and financial aid data. There is a wide recognition, however, that the reasons for student 
dropouts can range based on several other factors outside that scope (Astin & Astin, 1992; Pascarella, 
Terenzini, & Blimling, 1994; Xenos, Pierrakeas, & Pintelas, 2002; Flowers, 2004; Pascarella, 2006; 
MacGregor & Leigh Smith, 2005; Li, Swaminathan, & Tang, 2009). Moreover, a number of students who 
are not retained do not exhibit any early signs of academic struggle as per their grades. The inverse is 
also true, as there are often highly engaged students who despite performing below expectations, 
remain enrolled. Figure 1 illustrates these two specific groups of students. 


In this paper, we focus on remedying the shortcomings that arise when classification models are trained 
using only student academic performance and demographic data. We collected data that describe the 
access patterns of first-year engineering students to their personal electronic portfolios, which are 
dynamic web-based environments where students can list and describe their skills and achievements, 
and we show how these features correlate to and can help enhance the prediction accuracy of student 
attrition. In particular, we investigate how measurements of student engagement can be used to 
decrease miss-prediction rates of instances belonging to the groups highlighted in Figure 1. 


The remaining portion of this paper is organized as follows. Section 2 gives an overview of the most 
recent related literature. Section 3 describes the context in which this study was carried out and gives 
insight as to our decision to utilize electronic portfolios to measure student engagement. Section 4 
expands that idea by highlighting some preliminary observations on the usefulness of ePortfolios. 
Following, section 5 describes our dataset in detail. The methodology and experimental results are 
covered in sections 6 and 7 respectively, and a brief discussion of our findings concludes this paper in 
section 8. 
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Figure 1: Low performing/highly engaged students (quadrant I) are often retained. High 
performing/disengaged students (quadrant IV) may not persist. 


2. RELATED WORK 


From a sociological standpoint, student attrition has been studied in great detail. Seidman (2005) and 
Tinto (2012) provide comprehensive studies that investigate the causes and consequences of this issue. 
Though related, this ramification of the literature is outside the scope of this paper. Following, we 
provide a more elaborate description of the most recent works that utilize student data to create and 
evaluate prediction models for student attrition. 

Early work by DeBerard et at. (2004) combined academic performance, demographics and self-reported 
survey data of students in an attempt to forecast cumulative GPA using linear regression, and retention 
rates via logistic equations. The former achieved commendable results while the outcomes of the later 
were not statistically significant. Contemporary to that, a study by Zhang et at. (2004) showed that high 
school GPA and math SAT 1 scores were positively correlated to graduation rates of engineering 
students, while verbal SAT scores correlated negatively with odds of graduation. Similar findings are 
reported by Mendez et at. (2008). 

A key premise of our work is highlighted by Burtner (2005). After monitoring a group of incoming 
engineering students over a three year period, the author concludes that while a predictive model based 
on cognitive variables such as the students' math and science ability can perform relatively well, it would 


1 A standardized test widely used for college admissions in the United States 
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greatly benefit if non-cognitive factors developed during the freshman year were to be incorporated. Lin 
et at. (2009) validate that idea by showing that the accuracy of their classification models improves after 
the inclusion of non-cognitive features extracted from a survey. 

Yu et at. (2007) utilize decision trees to predict student retention, and among other discoveries, the 
authors report that in their context, student persistence was more closely related to the students' 
residency status (in/out of state) and current living location (on/off campus) than it was to performance 
indicators. Likewise, a sensitivity analysis exercise performed on neural networks, decision trees, 
support vector machine and logistic regression models by Delen (2010; 2011) ultimately concluded that 
several important features utilized to predict student retention were not related to academic 
performance. 

With the intent of developing a long-term intervention system to enhance student retention, Zhang, 
Oussena, Clark, & Hyensook, (2010) tested three different classifiers and observed that the best 
prediction accuracy for student retention was yield by naive Bayes. Alkhasawneh, (2011) utilizes neural 
networks to predict first year retention and provides an extensive analysis of his models' performance. 
Finally, we highlight the recent work by Thammasiri et at. (2013), in which the problem of predicting 
freshmen student attrition is approached from a class imbalance perspective, and the authors show how 
oversampling methods can enhance prediction accuracy. 

3. CONTEXT 

3.1 The College of Engineering 

The University of Notre Dame is a medium-sized, Midwestern, private institution with a traditional 
student composition, i.e., the vast majority of students complete their undergraduate studies in four 
years and are in the age range of 18 to 22. The overall student body is 53 percent male and 47 percent 
female, while the College of Engineering is approximately 75 percent male and 25 percent female. First- 
year students are admitted to the First-Year of Studies program regardless of their intended future 
major. Students select their major (whether engineering or something else) near the end of their first 
year when they register for classes for the upcoming fall semester. Beyond admission/selection into the 
university as a whole, there are no admission or selection criteria for entering any of the disciplines of 
engineering; rather, it is based on student interest alone. 

With few exceptions, first-year students considering an academic pathway within engineering complete 
a standard first-year curriculum, including the two-semester course sequence of "Introduction to 
Engineering," taught within the College of Engineering. Each year the course sequence has enrollments 
of approximately 450 to 550 students. The course has two main objectives: (1) to expose students to the 
engineering profession and engineering major options, and (2) to demonstrate the processes of 
planning, modelling, designing, and executing specified project deliverables. The course curriculum uses 
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a project-based learning approach, with students completing a total of three group projects across the 
two-semester sequence. Students are required to attend large lecture sections, which introduce basic 
concepts needed to complete the projects, and small group (30 to 35 students) learning centres that 
focus on hands-on learning. For over a decade, the course sequence has included similar material and 
project-based course assignments, including homework, quizzes, exams, technical reports, and 
presentations. 

3.2 ePortfolios for Engagement 

The ePortfolios serve as a creative space and a recording system that utilizes digital technologies to 
allow learners to collect artifacts and examples of what they know and can do, in multiple media 
formats, using hypertext to organize and link evidence to appropriate outcomes/skills, goals, or 
standards (Barrett, 2007). ePortfolios capture and document student learning and engagement through 
their reflection, rationale building, and/or planning. Chen and Black (2010) found ePortfolios generate 
shared responsibility and ownership of learning between students and instructors since they can be 
used inside and outside the classroom. They are also available and can be used on and off campus, in 
face-to-face and virtual environments, and during and after the student's time in college (as a way of 
practically demonstrating what ABET (Accreditation Board for Engineering and Technology, 2013) refers 
to as "life-long learning" achievements. Al-Atabi et al. (2011) found the use of ePortfolios to be valuable 
as an advising tool, allowing students to track the progress of their learning outcomes, to provide 
documentary evidence, and to be used when they meet regularly with their academic advisors for 
feedback. Significantly, the use of ePortfolios generates intentional and active learners since students 
become self-aware and take ownership of their academic progress. 

Higher education institutions such as Bowling Green State University (Knight, Hakel, & Gromko, 2008), 
La Guardia Community College (Eynon, 2009), University of Barcelona (Lopez-Fernandez & Rodriguez- 
lllera, 2009), Spelman College (Price, 2006), Clemson (Ring, Weaver, Jones, & others, 2009), Penn State, 
and Florida State Universities (Yancey, 2009) have begun to implement ePortfolio initiatives to enhance 
engagement and measure impact through integrating life-wide academic, personal, and professional 
contexts. Kahu (2013) points out that because of its complex nature, finding a common definition for 
student engagement across the several fields that study it is a difficult task. The authors subscribe to 
Kuh et al.'s (2000) student engagement as a construct that measures the alignment between what 
effective institutions purposefully do (a range of teaching practices and programmatic interventions) to 
induce and channel students to desired outcomes, compared with what students actually do with their 
time and energy towards achieving these educationally purposeful activities (Kuh, Hu, & Vesper, 2000). 
More specifically, using Kahu's (2013) behavioural dimension of engagement, we focus on the digital 
observations, or behavioural measurements, of online time on task, effort, and participation. 

The ePortfolio platform of our choice is supported by Digication (Digication e-Portfolios, 2013) and its 
Assessment Management System (AMS). Their paid subscription account not only offers an ePortfolio 
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platform but also provides a powerful back-end course, program, institution, or inter-institution AMS. 
Within individual, and across our partnering institutions, the AMS tracks, compares, and generates 
reports on student progress and performance by standards, goals, objectives, or assignments. 


Realizing the importance of having a deep understanding of how students interact with and make use of 
their electronic portfolios, we worked alongside Digication to develop an automated pipeline for data 
collection that has since been deployed. We collect student-level data on a daily basis, and while the 
work described in this paper made use of aggregated datasets collected at the end of the 2012 fall 
semester, in future we will be able to scale our models for predicting early signs of risk to much more 
actionable time stamps. 

3.3 ePortfolio Engagement as an Analytic 

For too long much of the emphasis on improving retention has focused solely on the binary metric of 
retention (yes/no). By focusing on student engagement rather than just predictive variables, after-the- 
fact outcome of retention, or a subjective measurement of learning, the ePortfolio provides a window 
into the time, energy level, and commitment exhibited by students throughout the trajectory of a given 
course. The assessment focus on retention is too late to provide an actionable window and improve 
learning within a course, especially during the first semester of college. 


An ePortfolio engagement analytic has important implications for the emerging field of learning 
analytics. Johnson et al. (2010) define learning analytics as the interpretation of a wide range of data 
produced by and gathered on behalf of students in order to assess academic progress, predict future 
performance, and spot potential issues. The goal of learning analytics is to enable educators to 
understand and optimize learning via an environment tailored to each student's level of need and ability 
in close-to-real time. Up until now, most of the data sources have been limited to learners' tacit digital 
actions inside the learning management systems (i.e., discussion forum posts, downloading content to 
read, login rates, and duration). The ePortfolio tool and platform offers a more authentic environment 
that could provide a week-by-week measure to identify if and when students are losing engagement and 
explore why, where, and what is engaging students as well as how they spend their time and energy 
outside of the class. Therefore, data mining the ePortfolios could generate more effective learning 
analytics to improve the understanding of teaching and learning, and to tailor education to individual 
students more effectively. 

3.4 ePortfolio Use in the First-Year Engineering Course 

In the 2012-2013 academic year, ePortfolio assignments were integrated with the traditional course 
deliverables as a means to guide student reflection on their education. Eleven ePortfolio updates were 
assigned throughout the academic year. For the course, all students were required to create an 
ePortfolio following an instructor-designed template. The ePortfolio template included three main 
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sections, which were each updated over the course sequence: 

1 Engineering Advising: Required reflection on their engineering major choice and their progress 
towards engineering skill areas. Seven skills areas were defined, each relating to ABET 
accreditation-required outcomes (a-k). 

2 Project Updates: Required updates following the completion of each project. Minimally, 
students were asked to include a picture of their project and a reflection on skills developed 
through the project. 

3 Engineering Exploration: Required reflection after attendance at eight engineering related 
events that took place outside of the course. Events included seminars, engineering student 
group meetings, professional development activities, et cetera that were delivered by various 
groups within the university. 



Figure 2: ePortfolio examples highlighting each of the three categories mentioned above. 

Although ePortfolio assignments were a required portion of the course, they were graded largely for 
completion. Therefore, student effort towards their ePortfolio assignments had wide variability. In 
addition, students were encouraged to personalize their ePortfolios to include additional pages and 
information not required by the course. Because students were asked to share this ePortfolio with their 
advisors after matriculating into engineering departments, they were encouraged to keep any additional 
content professional in nature. 

Goodrich et al. (2014) describes in detail how electronic portfolios have been incorporated into our 
Introduction to Engineering curriculum, and how they can potentially be used to measure student 
engagement. The paper contrasts instructor-generated ratings for student engagement with metrics 
extracted from those student ePortfolios, and it shows that for that particular cohort, the engagement 
estimates provided by the ePortfolio variables were significantly more strongly correlated to retention 
outcomes than were the instructor ratings. Given this particular context of how ePortfolios are utilized, 
we believe that an important correlation between the students' engagement in using this tool and 
retention levels exists and can be potentially mined for predictive analysis. 
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3.5 ePortfolio Use across the First Year 


At the time of this study, the University of Notre Dame was two years into a larger University ePortfolio 
initiative where half of the students on campus were using ePortfolios in at least one capacity. The 
College of First Year Studies had launched its own initiative for all first-year students, to flip and enhance 
the advising process of their one-on-one sessions with the ePortfolio. Using the blended advising model 
(Ambrose & Williamson Ambrose, 2013), the Advising ePortfolio was used to pre-engage students by 
asking them to plan and list goals in their ePortfolio before their face-to-face advising session. This 
would lead to a more deeply engaged one-on-one advising interaction because the students came 
prepared. Then advisors and students would have a platform to re-engage and review progress and 
growth over the year. For a detail overview of this university-wide ePortfolio initiative, see Ambrose, 
Martin, and Page (2014). 

In addition, a small cohort of 12 first-generation students (from a family in which no parent or guardian 
has earned a baccalaureate degree) who were intended Engineering students were also enrolled in a 
one-credit Independent Self-Study Advising Seminar as part of an invited scholars program for 
underrepresented students. 

While section 5 will provide more detail on our datasets and describe each of the features we analyze, a 
preliminary case study illustrated by Figure 3 showed that this select group of students who were more 
exposed to electronic portfolios as part of the previously mentioned enhanced academic program 
exhibited markedly higher levels of engagement using that tool, and more importantly, was retained in 
its entirety. The figure below points to a possible transference and increased engagement effect when 
students are exposed to a more integrative, across time and other contexts, ePortfolio experience. In 
addition, it may also point to the engagement potential, particularly for a cohort of first-generation and 
underrepresented students. 



FI All first year engineering students | Select first generation students who were part of academic development program 


Figure 3: The transference effect of ePortfolio engagement on a select first generation student 

sample. 
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4. PROBLEM, GOAL & QUESTION 

As previously mentioned, factors providing early indications that a student may be at risk can come from 
a variety of categories. Data that has been aggregated over time can offer insights as to which sub¬ 
populations of a student body are more likely to need closer attention, whereas individual student-level 
performance data can be used to flag students who are beginning to diverge from their optimal path to 
success. 

An important task, then, is to decide which information to track for each student so that we can most 
efficiently generate early risk warnings when these are warranted. As we claimed earlier, placing 
disproportional focus on academic performance data can result in warning systems that may fail to 
identify students losing interest and disengaging from school when these shortcomings are not reflected 
in that student's performance. 

To illustrate, take for instance the performance data summarized in Figure 4. There we can see how the 
first semester cumulative GPAs for both retained and not retained students in our dataset were 
distributed. While it does appear that the retained students perform, on average, slightly better, we 
note that such disparity is not very substantial. Furthermore, we can see that a fairly sizable number of 
students who were not retained concluded their first semesters with very high GPAs (e.g., > 3.0), which 
is a reasonable indication that they would have been likely to succeed as engineering students had they 
chosen to remain enrolled in the program. A similar observation can be made when we look at the 
course grades for Introduction to Engineering (EG 111) alone. 



Figure 1: Academic performance breakdown by retention outcome. 


ISSN 1929-7750 (online). The Journal of Learning Analytics works under a Creative Commons License, Attribution - NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0) 


15 


























JOURNAL OF LEARNING ANALYTICS 


S8LAR 

SOCIETY ><w LEARNING 
ANALYTICS RESEARCH 

(2014) Engagement vs Performance: Using Electronic Portfolios to Predict First Semester Engineering Student Persistence. Journal 
of Learning Analytics, 1 (3), 7-33. 

With that premise in mind, this paper attempts to investigate the viability of using ePortfolios as a proxy 
to measuring student engagement. Again, splitting the same cohort of students depicted in Figure 4 into 
the subgroups of retained and not retained, we see in Figure 5 that on average, students on each of 
these two groups appear to place discernibly different amounts of time and energy when interacting 
with their portfolios. 



Figure 5: Average breakdown of ePortfolio engagement activity. 

While Goodrich, Aguiar, Ambrose, Brockman, & Chawla (2014) presents a more elaborate description of 
this information and a comparison between these observed values and similar engagement estimates 
generated by course instructors, we highlight that the distribution of values for each of the three 
features illustrated in Figure 5 (i.e., number of logins during the fall semester, number of evidences 
submitted through ePortfolio, and number of ePortfolio page hits received) with respect to retained and 
not retained students is statistically significantly different, with p-values displayed in Table 1. 

Table 1: P-values of observed distributions for three ePortfolio features by retention outcomes. 


ePortfolio Feature 

P-value 

Number of logins 

< 0.005 

Number of evidences submitted 

< 0.05 

Number of page hits 

< 0.005 
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5. DATASET 

This study used data collected from a single cohort of incoming freshmen students registered in a first 
semester Introduction to Engineering course. This particular group was made up of 429 students, the 
vast majority of whom had engineering majors listed as their first year intent and remained in the 
program for the subsequent semester, leading to a very imbalanced dataset. While majors are not 
formally declared until their sophomore year, students are asked to state their intended majors when 
submitting their application package and prior to their first semester on campus. 

5.1 Description 

A variety of features that describe each student's academic performance, engagement, and 
demographic background were made available to this project from multiple sources. These were then 
matched student-wise and merged into a single dataset. After an initial analysis of the data, we decided 
to exclude a number of features that either (1) had no apparent correlation to the outcome variable, (2) 
directly implied it, or (3) provided redundant information. Further, we also removed 10 instances that 
had a very considerable amount of missing data. These particular instances corresponded to students 
who dropped out early in the semester and hence had no academic performance or engagement data 
available. Table 2 lists and describes each feature available in our final dataset and Table 3 groups these 
into their respective categories. 


Table 2: Dataset features described 


Name 

Description 

Type 

Adm Intent 

Intended college major as specified by the student in his/her application 
package 

Nominal 

Adm Type 

Type of admission earned by student (e.g., early, regular, waiting list) 

Nominal 

AP Credits 

Number of credits earned through AP courses taken prior to college enrollment 

Numeric 

Dormitory 

Name of dorm where the student resides (all first-year students are required to 
live on campus) 

Nominal 

EG 111 Grade 

Letter grade obtained in the introduction to engineering course 

Numeric 

ePort Hits 

Hit count for the student's ePortfolio pages during the fall semester 

Numeric 

ePort Logins 

Number of times the student logged in to his/her ePortfolio account during the 

fall semester 

Numeric 

ePort Subm 

Number of assignment submitted via ePortfolio during the fall semester 

Numeric 

Ethnicity 

The student's self-declared ethnicity 

Nominal 
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First Gen 

A flag to denote first-generation college students 

Binary 

FY Intent 

Intended college major as specified immediately prior to the beginning of the 

fall semester 

Nominal 

Gender 

The student's gender 

Binary 

Income 

Group 

A numeric value ranging from 1-21, each corresponding to a different income 

group segment 

Numeric 

SAT Comb 

Combined SAT scores 

Numeric 

SAT Math 

SAT score for the math portion of the test 

Numeric 

SAT Verbal 

SAT score for the verbal portion of the test 

Numeric 

Sem 1 GPA 

The student's overall GPA at the end of the fall semester 

Numeric 

Retained 

A flag identifying students that left the program immediately after the fall 

semester 

Binary 


It is worth noting that this particular dataset has a highly imbalanced class distribution wherein only 11.5 
percent of the instances belong to the minority class (student not retained). As described in Thammasiri 
et al. (2013), predicting student retention becomes more challenging when the available training sets 
are imbalanced because standard classification algorithms usually have a bias towards the majority 
class. 

5.2 Feature Selection 

As a second step to preparing our dataset, we carried out a series of tests to investigate how strongly 
correlated to the outcome each feature was. In general, performing feature selection as a means for 
reducing the feature space provides some benefits when building classification models. Namely, the 
model becomes more generalizable and less prone to overfitting, more computationally efficient, and 
easier to interpret. 

The following feature selection methods were used: information gain (IG) (Quinlan, 1986), gain ratio 
(GR) (Quinlan, 1993), chi-squared (CS), and Pearson's correlation (CR). The first evaluates the worth of 
each attribute by measuring its information gain with respect to the class. Gain ratio works in a similar 
manner while adopting a different metric. CS and CR compute chi-squared and Pearson's correlation 
statistics for each feature/class combination. 

The results of our experiments are summarized in Table 4, where the top 10 features ranked by each 
method are listed, and the highest correlated one is highlighted for each column. 
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Table 1: Dataset features categorized. Table 4: Feature selection rank. 



IG 

GR 

CS 

CR 

Adm Intent 

2 

4 

2 

- 

SAT Math 

7 

2 

5 

3 

SAT Verbal 

- 

- 

- 

- 

SAT Comb 

9 

10 

9 

- 

AP Credits 

8 

8 

8 

8 

FY Intent 

5 

7 

6 

9 

EG 111 Grade 

4 

5 

3 

6 

Sem 1 GPA 

- 

- 

- 

1 

Adm Type 

- 

- 

- 

- 

Gender 

10 

9 

10 

5 

Ethnicity 

- 

- 

- 

- 

Income Group 

- 

- 

- 

7 

First Gen 

- 

- 

- 

- 

Dormitory 

3 

6 

4 

- 

ePort Logins 

6 

3 

7 

4 

ePort Subm 

- 

- 

- 

10 

ePort Flits 

1 

1 

1 

2 


Type 

Subtype 

Feature 

Academic 

Pre-Ad m 

Adm Intent 

SAT Math 

SAT Verbal 

SAT Comb 

AP Credits 

Post-Adm 

FY Intent 

EG 111 Grade 

Sem 1 GPA 

Retained 

Demographics 


Adm Type 

Gender 

Ethnicity 

Income Group 

First Gen 

Dormitory 

Engagement 


ePort Logins 

ePort Subm 

ePort Flits 


Several interesting observations can be derived from these results. First, we emphasize that all but one 
method reported ePort Hits as being the most important feature of the dataset. In other words, there 
appears to be a strong correlation between the number of times a certain student's electronic portfolio 
pages are visited and that student's decision to stay or withdraw from the College of Engineering. Note 
that these hits originate from both external visitors and the students themselves. While the current data 
does not allow us to discern the two scenarios, we suspect that the majority of the hits do in fact come 
from the portfolio owner. If that is indeed the case, this noticeable correlation could be explained simply 
by the fact that students whose portfolios exhibit a larger number of hits are likely to be those who 
spend more time editing their pages, creating new content, and submitting assignments (as those 
actions directly contribute to that student's page hit count). It would then make reasonable sense that 
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this display of engagement should, in some cases, correlate with the chances of this particular student 
being retained. 

Further, we noticed that some of the features had no substantial individual correlation to the class 
values. For instance, in our particular context, ethnicity, admission type, first generation status, income, 
and the number of assignments a student sent through his/her ePortfolio did not appear to be closely 
related to that student's retention in the program. As reported by Zhang, Anderson, Ohland, and 
Thorndyke (2004) and by Mendez, Buskirk, Lohr, and Flaag (2008), we also observed minor negative 
correlations between verbal SAT scores and engineering student retention. 

To compare the performance of classification models based on traditional academic data to that of 
models based on student engagement features effectively, we created four subsets from the original 
data. These are described below: 

• All-academic: This subset contained all academic and demographic features listed in Table 3. 

• Top-academic: Following the feature selection process described above, this subset contains 
only the top three academic and demographic features. Multiple wrapper methods (i.e., which 
can score feature subsets rather than individual features alone) were used, and the final subset 
chosen contained the following: Admin intent, EG 111 grade, and Sem 1 GPA. 

• All-engagement: Contained the three ePortfolio engagement features. 

• Top-academic + engagement: This final subset contained the optimal three-element 
combination of features across all initially available. These were EG 111 grade, ePort logins, and 
ePort hits. 

6. METHODOLOGY 

For this study, we selected a range of classification methods that have been previously utilized in this 
particular domain, or that are suitable to work with imbalanced datasets. Following is a brief description 
of each classifier and the evaluation measurements we use to compare their performance. 

6.1 Classification Methods 

6.1.1 Naive Bayes 

Among the simplest and most primitive classification algorithms, this probabilistic method is based on 
the Bayes Theorem (Bayes & Price, 1763) and strong underlying independence assumptions. That is, 
each feature is assumed to contribute independently to the class outcome. In predicting student 
attrition, naive Bayes classifiers have been used by Pittman (2011), Zhang, Oussena, Clark, and Hyensook 
(2010), and Nandeshwar, Menzies, and Nelson (2011). Notably, the best results reported in Zhang et al. 
(2010) were achieved via this method. 
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6.1.2 C4.5 Decision trees 


Another very popular classification method, C4.5 decision trees (Quinlan, 1993) have been used to 
predict student retention multiple times in recent literature such as in Yadav, Bharadwaj, and Pal (2012), 
Nandeshwar, Menzies, and Nelson (2011), Laurfa (2012), and Lin (2012). This method works by building 
a tree structure where split operations are performed on each node based on information gain values 
for each feature of the dataset and the respective class. At each level, the attribute with highest 
information gain is chosen as the basis for the split criterion. 


6.1.3 Logistic regression 

Logistic regression is often used as a classification method wherein a sigmoid function is estimated 
based on the training data, and used to partition the input space into two class specific regions. Given 
this division, new instances can be easily verified to belong to one of the two classes. This approach has 
been used to predict student retention, and has often achieved highly accurate results (Luna, 2000; Fike 
& Fike, 2008; Laurfa, 2012; Lin, Imbrie, & Reid, 2009; Zhang, Anderson, Ohland, & Thorndyke, 2004; 
Herzog, 2006; Veenstra, Dey, & Herrin, 2009). 

6.1.4 Hellinger distance decision trees 

When applying learning algorithms to imbalanced datasets, one often needs to supplement the process 
with some form of data sampling technique. Hellinger distance decision trees were proposed as a 
simpler alternative to that (Cieslak, Hoens, Chawla, & Kegelmeyer, 2012). This method uses Hellinger 
distance as the splitting criterion for the tree, which has several advantages over traditional metrics, 
such as gain ratio in the context of imbalanced data. To the best of our knowledge, this method has not 
yet been used to predict student retention, but given that our dataset is highly imbalanced, we chose to 
investigate its performance. 

6.1.5 Random forests 

Random forests combine multiple tree predictors in an ensemble (Breiman, 2001). New instances being 
classified are pushed down the trees, and each tree reports a classification. The "forest" then decides 
which label to assign to this new instance based on the aggregate number of votes given by the set of 
trees. Recent work by Mendez, Buskirk, Lohr, and Haag (2008) used this method to predict science and 
engineering student persistence. 

6.2 Evaluation Measures 


In order to compare the results obtained by each of the classifiers, as well as the four different data 
"subsets," we utilized a variety of measures. A very popular standard used to evaluate classifiers is the 
predictive accuracy. Note, however, that utilizing this metric to evaluate classification based on 
imbalanced datasets can be extremely misleading. 
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To illustrate this issue, suppose that upon being given our dataset, an imaginary classifier predicts that 
all students will be retained in the engineering program following their first semester enrolled on 
campus. This will result in a remarkable 88.5 percent accuracy (recall that only 11.5 percent of the 
students in this dataset left the program at the end of their first semester). It is obvious, however, that 
such a classifier should not be awarded any merit since it fails to identify all students that should have 
been labelled as being at risk. Instead, it is more appropriate to analyze the prediction accuracy for each 
individual class, or to use ROC curves to summarize the classifier performance. These and other 
measures can be calculated using confusion matrices (see Figure 6). 


Given the binary nature of this specific classification problem, the corresponding confusion matrix 
reports four values: True Positives (TP) — the number of students not retained correctly classified; True 
Negatives (TN) — the number of retained students accurately classified as such; False Positives (FP) — 
the number of retained students mistakenly classified as not retained; and False Negatives (FN) — not 
retained students wrongfully predicted as retained. Based on these labels, the individual accuracies for 
the negative (retained) and positive 2 (not retained) classes, as well as the classifier's recall rates can be 
obtained as follows: 


accuracy + = 
accuracy~ = 
recall = 


TP 


TN + FN 
TN 


TN + FN 
TP 


TP + FN 


As previously mentioned, ROC curves are frequently used to summarize the performance of classifiers 
on imbalanced datasets. On an ROC curve, the x-axis represents the FP rate, FP/(TN + FP), and the y- 
axis denotes the TP rate given by TP/{TP + FN) at various threshold settings. The area under the ROC 
curve, AUROC, is also a useful metric for comparing different classifiers. The values for AUROC range 
from a low of 0 to a high of 1, which would represent an optimal classifier, as highlighted in Figure 7. 


2 The prediction accuracy for the positive class can also be labelled precision. 
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Figure 6: Confusion matrix for our experiments 



Figure 2: ROC curve illustration 

7. EXPERIMENTAL RESULTS 

To estimate how well the models generalize to future datasets, we utilized a 10-fold cross validation 
technique. This consists of splitting the original n data instances into 10 complementary subsets of size 
n/10, each of which preserving the original ratio of minority and majority class instances. The classifier 
is then given 9 of the subsets for training, and validation is performed using the remaining portion of the 
data. This process is repeated for 10 rounds using different partitions at each time, and an overall 
average of the results across each iteration is computed. 

The performance of each of the five classification methods described in section 6.1 was evaluated as 
they were used to perform prediction on each of the four available datasets. Table 4 displays the results 
of each individual experiment in terms of the prediction accuracy for the negative class instances (i.e., 
the ratio of retained students correctly labelled as retained ), the prediction accuracy for the positive 
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class instances (i.e., the ratio of not retained students correctly classified as not retained), and the 
overall weighted average across these two accuracy measurements. The highest accuracies achieved for 
each of the datasets are highlighted in bold, while the three highest overall are underlined. 

Before analyzing these results more deeply, it is essential to consider the degree of importance that 
should be assigned to each of these metrics. Given our binary classification problem, two types of errors 
could emerge. Students who ultimately remain in the program for the spring semester could be 
misclassified as not retained (false positives), and actual not-retained students could be mistakenly 
labelled as retained (false negatives). While some previous work (Dekker, Pechenizkiy, & Vleeshouwers, 
2009) considered the first type of error to be more serious, we argue that the opposite is true. If these 
techniques are to be used in the development of an effective early warning system, failing to identify 
students at risk of leaving can be much more costly than incorrectly labelling someone as an early 
leaver. 

In Table 5, we can see that predictions based only on academic performance and demographic data 
achieve a maximum acc+ of 27.5 percent when the all-academic dataset is paired with a naive Bayes 
model, which corresponds to only 11 of the 48 not-retained students being correctly identified. 
Conversely, when engagement features are utilized, that accuracy improves very noticeably to 83.3 
percent and 87.5 percent, both also achieved with the previously mentioned classifier. 

The naive Bayes model, using the top-academic + engagement dataset, remarkably identified 42 of the 
48 dropout students. The vast majority of those retained (331 out of 419) are also correctly classified. 
Note that the other four classifiers obtain higher acc- values under the same setup, and could potentially 
be the preferred choice depending on the circumstances. 

With respect to acc+, the naive Bayes classifier outperformed the others for all but one dataset. We 
used its experimental results in Figure 8 to illustrate the ROC and Precision-Recall curves for each 
dataset. In our particular context, it seems apparent that the ePortfolio engagement features are very 
good predictors for student retention. The highest AUROC value 3 (0.929) was obtained by the top- 
academic + engagement dataset, while all-academic performed worse with an AUROC of 0.654. 


3 A different logistic regression implementation using LI regularization yielded slightly higher AUROC. See Table 5 for details. 
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Table 5: Prediction accuracy achieved using each of the datasets 


Dataset 

Classifier 

Acc+ 

Acc- 

Acc 

all-academic 

NB 

0.275 

0.902 

0.842 

DT 

0.083 

0.930 

0.833 

LR 

0.104 

0.900 

0.803 

HT 

0.083 

0.884 

0.792 

RF 

0.000 

0.987 

0.874 

top-academic 

NB 

0.167 

0.954 

0.864 

DT 

0.042 

0.949 

0.845 

LR 

0.000 

0.981 

0.869 

HT 

0.250 

0.881 

0.809 

RF 

0.104 

0.892 

0.802 

all-engagement 

NB 

0.833 

0.879 

0.874 

DT 

0.771 

0.970 

0.947 

LR 

0.771 

0.978 

0.955 

HT 

0.771 

0.962 

0.940 

RF 

0.771 

0.970 

0.947 

top-academic+engagement 

NB 

0.875 

0.892 

0.890 

DT 

0.792 

0.962 

0.945 

LR 

0.750 

0.973 

0.947 

HT 

0.771 

0.965 

0.943 

RF 

0.750 

0.965 

0.940 
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ROC curve comparison between different feature sets 



False positive rate 


PR curve comparison between different feature sets 



Figure 8: Naive Bayes ROC and Precision-Recall 


7.1 Improving the Detection of at-risk Students with Data Oversampling 

As highlighted in section 5.1, the particular cohort of students utilized for our experiments had a very 
high retention rate of 88.5 percent. With such data imbalance, classification models can often learn 
hypothesis that are biased towards assigning the majority class label to new instances. One way to 
address this issue is to introduce, during the training phase of the models, "synthetic" examples of 
minority class instances. This can essentially decrease the rate of imbalance and it often lessens the bias 
towards the majority class. Previous work by Thammasiri et al., (2013) has investigated the effect of 
various over-sampling techniques on models designed to predict student attrition. One such method 
that was found to improve the overall performance was SMOTE (Chawla, Bowyer, Hall, & Kegelmeyer, 
2002), which works by generating synthetic instances by interpolating existing minority class 
occurrences with their k-nearest neighbours of the same class. 

To evaluate any benefits of data oversampling in our context, we used the top-academic + engagement 
dataset previously described, and repeated the experiments while introducing synthetic instances of the 
not-retained class. Table 5 illustrates the performances of our models in terms of AUROC when varying 
amounts of synthetic minority instances are used. There, "100 SMOTE" refers to the addition of a 
synthetic set of minority instances that contains the same number of samples as the original set, "200 
SMOTE" creates two new samples for each observed minority instance, and so on. As we can see, most 
models exhibit marginal improvements when SMOTE instances are used to decrease the class imbalance 
during model training. 
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Table 6: Classification models' AUROC with data 


Classifier 

No SMOTE 

100 SMOTE 

200 SMOTE 

300 SMOTE 

NB 

0.923 

0.923 

0.924 

0.926 

DT 

0.858 

0.866 

0.855 

0.849 

LR 

0.968 

0.969 

0.970 

0.971 

RF 

0.953 

0.960 

0.956 

0.941 


7.2 Alternative Model Evaluation 

While the previous sections provided a detailed comparison between the performances obtained by 
each model with respect to a variety of traditionally used evaluation metrics, when considering the real- 
world requirements of implementing an early warning system wherein interventions are put in place to 
help students labelled as being at risk, it is often more logical to evaluate our predictive models 
differently. Suppose for instance that a school is given a certain operational budget to deploy an 
intervention program for STEM students at risk of not being retained. Given this particular constraint, it 
might not be possible for the school administration to hold individual one-on-one interventions with all 
students at risk. Hence, they should ideally choose a predictive model that can most accurately identify 
which students are at highest risk. 

The models described in section 6.1 can, in addition to providing a binary label to a new student being 
evaluated, also output an associated probability value between 0 and 1 that corresponds to that model's 
estimate of the likelihood of this student not being retained. Based on a set threshold value in that same 
range, the model then decides if the student is to be labelled as likely to be retained or otherwise. 
Contextually, if a predictive model using a threshold of 0.5 generates a probability of 0.6 for student A 
and 0.99 for student B, they will both receive a "not retained" label. However, chances are that student 
B requires more immediate attention. 

If we generate a probability value for each student in a given cohort, we can then create an overall 
sorted rank with respect to these values. In this setting, a preferred model would be one that can 
generate high probabilities for those students indeed at risk and lower probabilities for those on track to 
succeeding. Next, we compare the performances of four of our models in terms of how precise they are 
on the top 5 and 10 percent of the probability scores they generate. 4 That is, if we isolate the 5 and 10 
percent of students who receive the highest scores from each model, we would like to know what 
fraction of that group eventually left engineering. This metric is especially helpful in the context of early 
warning systems because when working with time and budgetary constraints, one will often need to 
choose a small subset of students on which to focus the appropriate intervention. 

4 A 10-fold cross validation setup was utilized to score all students. 
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As seen in Table 6, the models that include engagement features outperform those that only consider 
academic performance when we look at precisions on both the top 5 and 10 percent. Further, we 
highlight that the results obtained by the logistic regression model that only used engagement features 
were the highest observed. In our specific scenario, the top 5 percent accounted for the 22 students 
whose probability scores were highest overall. Using a logistic regression (or decision tree) model 
coupled with the engagement dataset to select the 22 students at "highest risk" gives us a precision of 
0.954, which means that 21 of the 22 selected students were in fact not retained. 

When pairing at-risk students with the appropriate interventions based on flags generated by early 
warning systems, we might not necessarily give preference to models that can be highly accurate, but 
we should rather evaluate their ability to select those students who need the most immediate attention. 
Though we had previously shown that the naive Bayes model is the most accurate in the aggregate, here 
we show that other models, in fact, might provide a more appropriate solution from an operational 
standpoint. 


Table 7: Precision on the top 5 and 10% 


Dataset 

Classifier 

Top 5% 

Top 10% 

all-academic 

NB 

0.400 

0.268 

DT 

0.200 

0.243 

LR 

0.400 

0.341 

RF 

0.350 

0.292 

top-academic 

NB 

0.300 

0.268 

DT 

0.250 

0.244 

LR 

0.400 

0.268 

RF 

0.150 

0.170 

all-engagement 

NB 

0.800 

0.756 

DT 

0.954 

0.756 

LR 

0.954 

0.878 

RF 

0.900 

0.829 

top-academic+engagement 

NB 

0.650 

0.682 

DT 

0.900 

0.756 

LR 

0.900 

0.878 

RF 

0.900 

0.829 
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8. CONCLUSION AND FUTURE WORK 

Given a course with an enrollment of 429 first-semester, first-year Intro to Engineering students and an 
average retention rate of 85 percent, the challenge was to develop, train, and test classification models 
to predict whether students would persist into the second semester beyond traditional measures of 
performance. How do you identify not just the high interest/low ability at risk students but also the low 
interest/high ability students? More specifically, we wanted to compare the performances of models 
based on traditional academic data (SAT scores, GPA, demographics, etc.) and models based on data 
extracted from ePortfolios (#logins, #hits, etc.). The basic premise of this being that while early warning 
systems based on academic performance data alone can be very good at finding students struggling in 
class and alerting their advisors, early signs of a possible decision to drop out of a certain academic path 
can be very difficult to find when you are dealing with students who do not display any changes in their 
class performance. 

We investigated the feasibility of using ePortfolio data as a proxy to measuring student engagement and 
showed that these particular variables can be highly predictive of college retention outcomes. We 
described that while datasets that do not contain features that quantify student academic engagement 
can often yield reasonable results, providing such features to the classification models greatly increases 
their ability to identify students who may ultimately leave. Our experiments showed significant gains in 
accuracy when engagement features were utilized, and we believe this can be used to build early 
warning systems that would be able to identify at-risk students at very early stages of their academic 
life, giving educators the opportunity to intervene in a more timely and effective fashion. Our key 
findings included the following: 

• Out of a set of several academic performance, demographic, and ePortfolio features, the 
number of ePortfolio hits displayed, based on multiple metrics, the strongest correlation values 
to outcome (student was retained/not retained). 

• The performance of the prediction models that used only ePortfolio data was consistently better 
than that of models based on academic performance data alone. 

• The best prediction results were obtained by using a subset of the data containing the following 
features: EG 111 grade, ePort logins, and ePort hits. 

• Using only academic performance data and a leave-one-out cross validation setup, we were able 
to identify 11 of the 48 students not retained past their first semester (out of the 429 students 
in the course). By adding the ePortfolio features, our model's performance dramatically 
improved and we were able to label 42 of the 48 students correctly while incurring very minor 
losses in accuracy regarding the retained group. 

While the results presented in this paper were obtained using an aggregated dataset collected at the 
end of the fall semester of 2012, we have since begun collecting ePortfolio usage data on a daily basis. 
As we have shown for this particular cohort of engineering students, low engagement with this tool can 
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work as an indicator that a certain student might not persist in engineering coursework in the following 
semester. With that in mind, we plan to investigate how soon after the beginning of the semester one 
can effectively determine that a student is "disengaging" and what interventions can be conducted 
during the semester to attempt to increase engagement levels before the course ends. 

Furthermore, we have begun to compare, combine, and triangulate traditional learning analytics data 
from Sakai, the campus learning management system, with the academic, demographic, and 
engagement data sets. Other programs in the College of Science are also attempting to replicate and 
scale the ePortfolio design, utilization, and analytics gained from this Intro to Engineering course. 

While for the most part we can measure how much time and energy a student puts into his/her 
ePortfolio by looking at the quantitative metrics discussed in this paper, there may also be substantial 
knowledge to be gained from understanding the qualitative data within the ePortfolio. To that end, we 
have also started to explore insights that can be harvested by text mining the actual content of students' 
ePortfolios. Lastly, we intend to condense this predictive analysis and students' engagement trajectories 
over time into an interactive dashboard that can be used by academic advisors who will then be able to 
track student progress and decide who is in need of individual attention and when that intervention 
should take place. 
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